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Why make a GPU Visualizer? 

• Pix and Razor Already Exist 

• Great for snapshots / single frames 

• Good for examining low level resources etc 

• Not so good for continuous capture or for 
investigating spikes (hard to capture spike 
frame) 





GPU Snooper Benefits 

•Realtime Live Profiling (no need to stop game or run 
analysis app) 

• Easy testing of Live scenario changes (recompile and hot- 
reload pixel shader - immediately see improvements) 

• Flipping binary switch in code and looking at live view 
differences 

• Correct Global Hierarchy Level (vs PIX for multiple 
contexts) 
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GPU Snooper 

• Quick Architecture Overview 

• History 

• Implementation 

• Performance and Memory Optimizations 

• Supporting multithreading and multiple contexts 

• Don't be scared - this is all "easy"* 
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NetherRealm Current GPU Profiler Architecture 


r i 

Multithreaded GPU 
Performance Recording 

L A 


r i 

Intermediate 

Format 

L A 

s 

r i 

Hierarchical 

Performance 

Display 

L. J 


t 

It ✓ 


Application 

5 S 

-► 

GFX Rendering Engine 
Code 

L A 

-► 

r i 

Frame Buffer / 
Rendering 
Output 

L A 


I U 


Frame Recording 
Manager 


Display Frame Ringbuffer 













GPU Snooper History 
as a Debugging Tool 





GPU Snooper was a Debug Tool 

• GPU Hangs were a Black Box © 

• Catch GPU Hangs and Stalls / Crashes 

• Originally designed with "printf" debugging 
early on in the XBOne/PS4 console development 
cycle before GPU crash dumps and debugging 

• Triggered by code (i.e. timer watchdog) or 
manually with console command 





GPU exception was detected. 

Rendering Thread (or GPU) Possibly hung. Waiting on label tor 123,42 seconds 


GPU SNOOPER - BEGIN INVESTIGATING TRACE 
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Implementing "printf" Debug 

• Create GPU Label 

• Simply memory accessible by GPU where we can write a 
value that is later read by CPU 

• PS4 == context. writeImmediateAtEndOfPipe() 

• Save GPU work trace 

• Enter / Exit Markup already done in code for PIX / RAZOR 

• Save "string" and Level 

• Write pointer to string in Label 





GPU Snooper "Debugger" 

(good for catching GPU hangs and crashes) 









GPU Snooper 
Transitions into 
a Profiling Tool 
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Request for Performance Timing? 

• So... add simple timing 

• Along with capturing strings (and level) also capture 
start / stop times 

• Turns out to be remarkably easy to do 





PS4 GPU Timer 


sourceCPU = see: :Gnm: : kEventWriteSourceGlobalClockCounter ; 
sourceGPU = see: :Gnm: : kEventWriteSourceGpuCoreClockCounter ; 
srcSelector = sourceGPU; 

GFXContext->m_dcb . writeAtEndOf Pipe ( 

see : : Gnm : : kEopFlushCbDbCaches , 
see : : Gnm : : kEventWriteDes tMemory , 

GPUC lock Time stamp r // Shared Onion Memory 
srcSelector , 

0 , 

see : : Gnm : : kCacheActionNone , 
see : : Gnm : : kCachePolicyLru 
) ; 





XBOne GPU Timer 


// Create D3D Query 
ID3DllQuery *timestamp; 

D3D11_QUERY_DESC QueryDesc; 

appMemzero (&QueryDesc, sizeof (D3D11_QUERY_DESC) ) ; 
QueryDesc . Query = D3D11_QUERY_TIMESTAMP; 

QueryDesc .MiscFlags = 0; 

pD3DDevice->CreateQuery ( &QueryDesc A & times tamp) ; 


// Inject Timestamp into GPU Context 
pD3DDeviceContext->End (timestamp) ; 





First Pass GPU Performance Profiling 









GDC 


GAME DEVELOPERS CONFERENCE March 14-18, 2016 Expo: March 16-18, 2016 #GDC16 



GPU Profiler ??? 

• So now we had timestamps 

• We're all done right? 

• By computing delta's we could see time per markup 

• Added timings to printf debugging 

• Wrong! Not Done © — Problems 

• One frame snapshot / Reading printf's is a pain 





Making GPU Snooper 
Profiling Useable With 
Visual Analysis 





Solution == GUI Visualizer 

• printf's -> hierarchical time graph 

• Simple very portable drawing code 

• Uses only 2 render primitives 

• Translucent Rectangles 

• Text / Strings 

• Optimization: Drawn as two batches / render calls 





NetherRealm First-Pass GPU Profiler Display 
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Hierarchical Chart Drawing 

• How did we draw the chart 

• Draw background and time markers for scale 

• Draw blocks that exceed a minimum time 

• Time is X value 

• Level (depth) is Y value 

• Limited depth to a user defined maximum 

• Have a small color separator on blocks 

• Truncate String based on size (minimum 3 characters) 

• Color of block based on quick hash of string 
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Visualizer == GPU Profiler 

• When easy to use, people use tool 

• Mostly GFX Programmers at first for optimization 

• But also artists and designers when looking at bog 
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Implementation Details: 

Game Side 





Implementation - Game Side 
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Game and Engine Modifications 

• Profiling Requires Manual Instrumentation 

• 99% chance you're already doing this 

• PS4 / Razor - pushMarker / popMarker 

• XB1 / PIX - BeginDrawEvent / EndDrawEvent 

• We use a scoped wrapper... 

• C++ object pushes marker in constructor / pops in destructor 

• Compiles out completely from retail builds 





Implementation Details: 
Performance Recording 





GPU Recording Implementation 
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Implementation 
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Implementation (Single Threaded) 
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Implementation Details: 
Capture Analysis 





Analysis of the Data 

• Parse Capture Buffer List for Events 

• Read Associated Timer for Start / Stop 

• Determine Hierarchy Level by Nesting 

• Transfer the data to a PerfSnapshot struct 
for visualization rendering routines 
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Implementation Details: 
Multithreading 





Multithreaded Implementation 

• Parallel context structures 

• Per thread GFX contexts 

• Single threaded because each thread gets 
their own version 

• Lock-Free Pool (SList) for Timers 

• Lock Accessed list of Snooper Data 
Structs 





Multithreading 

• Simplest version is just one main context 
calling deferred sub-contexts in a row and 
keeping a list of them. 

• But how do you handle allowing any 
context to call a deferred sub-context at 
any time? 





Multithreading GFX Contexts 
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Multithreading 

• At frame submission time, we will have an 
list of GPU Snooper Data structures that we 
can process (in order) 

• Analysis is exactly the same for MT as ST 
just potentially split across multiple 
"chunks" 





NetherRealm Second-Pass GPU Profiler Display 
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Implementation Details: 

Miscellaneous 





Optimizations 

• GPU performance impact very minimal 

• Optimizations are primarily on the CPU 
side (performance and memory) 

• Single-threaded code when possible (per context) 

• Lock Free Pools for Multithreaded Allocations 

• Byte-Packed Capture Buffers 

• Timers addressed by Index rather than Pointer 

Ilk 





Usability notes 

• Performed a couple days of tuning 
controls and GUI Visualizer 

• Getting zoom, movement, etc to feel natural 

• Example, original zoom was painful 

• Scale pan so it moves the same pixel speed regardless 
of zoom 

• Added "Key Chart" 





Implementation Details: 

Continuous 
Frame Recording 





Frame & Performance Recording 

• Keep around more Intermediate 
Performance Records (i.e. larger N) 

• Record Framebuffer Output 

• Simple Memory Ring Buffer of Scaled Output Frames 

• Can tie GPU Snooper capture to a frame 

• Allows freezing and scanning back & forth in time 





NetherRealm Current GPU Profiler Architecture 
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GPU Snooper: 
The Future ??? 
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Where do we go from here 

• Some future work already planned / 
sta rted 

• Additional ideas 





NetherRealm Current GPU Profiler Architecture 
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NetherRealm Profiler Architecture Future Goals? 
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NetherRealm Profiler Architecture Future Goals? 
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Adisak Pochanayon 

adisak@wbqames.com 


Twitter: @adisak 




GAME DEVELOPERS CONFERENCE March 14-18, 2016 Expo: March 16~l8 f 2016 #GOC16 


GDC 


<> 





S3.3 FPS 
<15.7 ms) 









•Click to edit Master text styles 

• Second level 

•Third level 
•Fourth level 
•Fifth level 




